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Abstract 

This paper presents an original way to add new 
data in a reference dictionary from several other 
lexical resources, without loosing any consis- 
tence. This operation is carried in order to get 
lexical information classified by the sense of the 
entry. This classification makes it possible to 
enrich utterances (in QA: the queries) following 
the meaning, and to reduce noise. An analysis 
of the experienced problems shows the interest 
of this method, and insists on the points that 
have to be tackled. 

1 Introduction 

Our society is currently facing an increasing 
amount of textual data, that no-one can store 
up or even read. Many automatic systems are 
designed to find a requested piece of informa- 
tion. All the current systems use dictionaries 
to identify data in texts or in queries. QA soft- 
wares, which are particularly demanding about 
data from the dictionary, have a similar mode 
of working: they process an utterance (gener- 
ally the query) in order to provide the largest 
number of way to express the same meaning. 
Then they try to find a match between the 
expanded utterance and a text. For example, 
flHull, 1999D expands synonymically the 'signif- 
icant' vocabulary of the question. QUALC 



(Ferret et al., 1999) adds stemming expansion 



prior to using a search engine. The Falcon sys- 
tem ( Moldovan et al., 2000 ) uses some semantic 



relations from WordNet when it expands the 
question. 

In this paper, I present a way to process 
dictionaries to make them consistent with the 
needs of the application. I first describe the 
lexical needs of my QA application. I secondly 
outline the issue of the use of several incom- 
patible dictionaries. Then I show the way I dis- 
tribute information from additional dictionaries 
to a reference one: synonyms, derivative forms 



and taxonomy. Finally I present the problems 
and difficulties I found. 

2 What the QA method needs 

The QA system flJacquemin, 2003| ) is based on 
a matching procedure between query and text 
segment. As most of the other approaches, my 
methodology solves the problem of the different 
ways to express the same idea by adding to the 
utterance ('enrichments') synonyms, derivatives 
or words belonging to the same taxonomy. 

My method entails two new features: First, it 
uses semantic disambiguation in order to choose 
the right meaning to each word in the sentences. 
I notice that most of the QA systems try to 
give as many enrichments as possible to a word 
rather than to a meaning. The answers often 
correspond to a sense different from the original 
one. But if each enrichment has the same sense 
as the original one, the noise decreases. 

The fact that a semantic disambiguator needs 
a large context to the word to be disambiguated 
d Weaver, 1949D provides the second feature: the 
query generally comprises few words. I decided 
to process the documents to build an enriched 
informative structure (Jacquemin, 2004). But 



this feature falls outside the scope of this paper. 

My semantic disambiguator 

( Jacquemin et al., 2002D is an evolution 
of a tool previously developed for both 
French and Eng lish at XRCE QBrun, 2000) 
Brun et al., 2001 ). The idea is to use a dictio- 



nary as a tagged corpus to extract semantic 
disambiguation rules. The contextual data 
(syntactic, lexico-syntactic and semantico- 
syntactic) for a given sense of a word are 
seen as differential indications. So when the 
schema is found in the context of this word in 
a sentence, the corresponding sense is assigned. 

In figure ^ we can see how a disam- 
biguation rule is extracted from an infor- 
mative field of Dubois' French dictionary 



( Dubois and Dubois-Charl icr, 1997). From the 



Example from Dubois' dictionary (entry: rem- 
porter): 

On remporte la victoire sur ses adversaires 

(sense nb 2 : gagner) 

We win a victory over our adversaries. 

Dependency containing remporter: 
VARG[DIR] (remporter , victoire) 

Corresponding disambiguation rule: 
remporter: VARG[DIR] (remporter , victoire) 
==> sense gagner 

Figure 1: Extraction of a semantic disambigua- 
tion rule. 



instance field of the entry remporter in its sec- 
ond sense gagner (to w in), the XIP parser 
(Ait-Mokhtar et al., 2002|) extracts a lexico- 



to complete Dubois' gaps. No synonyms dictio- 
nary shares out the synonyms by sense of the 



syntactic schema: VARG[DIR] means that the 
argument victoire is a direct object of the ar- 
gument remporter. The rule built from this de- 
pendency indicates that the sense of the word 
remporter, in a context where victoire (victory) 
is the direct object, is the second sense gagner 
(to win). Two other types of rules exists: the 
first type puts lexical rules into general use, re- 
placing lexical arguments by corresponding se- 
mantic classes. The other one uses syntactic 
schemas stipulated by the dictionary (for in- 
stance: transitive, reflexive, etc.). 

The dictionary needs of both QA system 
and semantic disambiguator are of two natures. 
First, the dictionary is required to share out 
data following sense and not following lemma: 
The data are differential indications. Second, 
the dictionary is required to contain contextual 
information as much as possible: examples or 
collocations (lexical rules), semantic classes or 
application domains (generalized rules), subcat- 
egorisation. . . The Dubois' dictionary yields to 
these demands, and moreover it contains some 
data that could be helpful to enrich an utter- 
ance: synonyms, instructions for derivations. . . 

3 Enrichment problems 

Several expansion solutions are proposed by the 
QA approaches: use of synonyms or taxonomy's 
members, stemming or use of derivatives. . . 

Dubois' dictionary contains some synonyms 
linked with a sense of the word they are syn- 
onymous with. But these synonyms are too 
few to provide sufficient enrichments. The sys- 
tem needs one or more synonyms dictionaries 



entry, except EuroWordNet ( Catherin, 1 999). 
But EuroWordNefs sharing out into senses 
does not match Dubois' senses. Thus the ques- 
tion is to distribute the available synonyms of 
each word to the right sense in Dubois'. 

The stemming, which considers two words 
with the same stem nearly synonyms, is too un- 
predictable to be used in a methodology that 
tries to avoid noise. As Dubois' provides in- 
structions to form derivatives from lemma and 
suffixes for some senses, the derivation is pre- 
ferred to the stemming. But the instructions are 
often vague, and indicate only the suffix to use 
and the new part-of-speech. It is not sufficient 
to be used automatically. Thus the derivation 
procedure needs an extra tool able to propose 
derivatives, including the right one. Dubois' in- 
formation is sufficient to filter and classify them. 

Finally, Dubois' does not provide taxonomy, 
and the French resources containing a seman- 
tic hierarchy do not supply contextual informa- 
tion. The taxonomy has to be found in another 
resource, which is not consistent with the ref- 
erence dictionary. The compatibility between 
senses of all these resources is the objective. 

4 How to make the dictionaries 
compatible 

The main difficulty is to share out information 
collected from extra dictionaries. The dictio- 
naries are incompatible with Dubois', but new 
data have to be distributed following the senses 
of the entries of the reference dictionary. 

4.1 Synonyms 

Three resources are at my disposal: Bailly's 



dictionary (Baill y, 1947 ), an electronic dictio- 
nary designed by Memodata, and the French 
EuroWordNet flCatherin, 19991 ■ The expansion 
methods commonly use all the available syn- 
onyms for a word, but my approach has to keep 
only the synonyms corresponding to the current 
sense of the word. For each considered sense for 
a word, Dubois' provides semantic features: a 
semantic class and an application domain. 

The synonyms from the extra dictionaries are 
proposals. A proposal for a lemma in Dubois' 
dictionary is kept for a given sense only if one 
sense at least of the Dubois' entry correspond- 
ing to the proposal matches the semantic fea- 
tures of the given sense. If no sense of the pro- 
posal matches the semantic features of the given 
sense, the proposal is rejected for this sense. 



Semantic features of the entry: 

Domain Class 
ravir (2) SOC S4 

Semantic features of synonyms: 

Domain Class 
charmer PSY P2 
voler SOC S4 

Figure 2: Selection of the synonyms. 



Derivatives for the verb couper: 
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Retained 


derivatives 


sense nb 1 


derivatives 
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coupage 


(age sense 5) 


removed 
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coupant 


ant 


coupant 



Figure 3: Selection of the derivatives. 



In figure I2 the problem is to determine which 
proposal matches the word ravir in the sense nb 
2 voler (to steal). The semantic features of this 
sense are the application domain SOC (sociol- 
ogy) and the semantic class S4 (to grip, to own). 
The proposal charmer, which features are PSY 
(psychology) and P2 (psychological verb) does 
not match the features of ravir 2. The proposal 
derober in its second and fourth senses has the 
same features. This proposal is confirmed for 
ravir in sense nb 2. It will be used as enrichment 
when the sense nb 2 of ravir is detected in an 
utterance by the semantic disambiguator. This 
procedure is applied for all the proposed syn- 
onyms for all the senses of each entry in Dubois'. 

4.2 Derivatives 

The derivation field in the Dubois' provides 
sufficient indications to recognize the stipu- 
lated derivatives of an entry in a determined 
sense. Thus, the need is a resource or a 
tool providing all the potential derivative from 
a word. Resources are rare and incomplete 
for French, but I have to my disposal a tool 



( Gaussier et al., 2 000 ) able to construct suffixal 
derivatives from a word. If the only constraint 
requires the derivatives belong to the lexicon, 
all the right suffixal forms are provided among 
the incorrect proposals. When all the propos- 
als are produced, the suffix of each proposal is 
compared with the instructions in the dictio- 
nary. When they match, the derivative is kept 
for the current sense. If not, the derivative is 
rejected. 

The figureOlshows the working of the method. 
For the verb couper in the sense trancher (to 
cut, to slice), Dubois' indicates derivatives with 
suffixes -ure: coupure (break), -ant: coupant 
(sharp) and -eur: coupeur (cutter). But no in- 
struction is given for a suffix -able. The wrong 
derivative coupable (guilty) is rejected. 



4.3 Taxonomy 

Only two resources containing taxonomy exist 
for French. AlethDic JGsi, 199 3) is known for 
its very bad quality. The hierarchy is neither 
very deep, nor very large. The semantic rela- 
tions are not strictly defined inside the hierar- 
chy. Because of this, I rejected AlethDic. 

The other resource is EuroWordNet. Two 
kind of taxonomic relations are defined: hyper- 
onymy (and hyponymy), and meronymy (and 
holonymy) . The other semantic relations of this 
resource fall outside the scope of this paper. 

The taxonomic relations link synsets to- 
gether. The synsets contain synonymous words 
for at least one of their senses. The taxonomy is 
usable by the QA system only if the sense of the 
whole synset can be identified, and if the sense 
matches at least one of the sense of the word 
under consideration in Dubois' dictionary. 

So each word in Dubois' has to be linked with 
a synset to be inserted into a taxonomic hierar- 
chy. That amounts to match senses in Dubois' 
and synsets in EuroWordNet. We already have 
some senses in Dubois' matching sets of syn- 
onyms in EuroWordNet. It is easy to use the 
additional synonyms from EuroWordNet to set 
up a correspondence between sense of Dubois' 
dictionary and synsets of EuroWordNet. 

The procedure is to examine all the synsets 
where a considered word appears. For each of its 
sense, if the majority of the synonyms obtained 
from EuroWordNet are contained in a synset, 
the meaning illustrated by the synset and this 
sense of the word are considered to be equiva- 
lent. In this case, the word under consideration 
is inserted in this place into the taxonomic hi- 
erarchy. Otherwise, the synset is not seen to 
match the sense, and it is rejected. 

5 Experienced problems 

The difficulties met differ for each kind of pro- 
cessed data. In the sharing out of the synonyms, 



the system cannot determine automatically the 
meaning of a multiword expression. Dubois' 
only lists single words, and no semantic feature 
can be allocated to a multiword expression. A 
multiword proposal is considered to have all the 
meaning of the word to which it is synonym. 

I have no real evaluation of this procedure: 
the division into senses of the reference dictio- 
nary is as always open to doubt. Considering 
the result, examiners never agree with each syn- 
onyms for a sense. But when they agree (three 
examiners where consulted), they where satis- 
fied by more than 80% of the synonyms. 

The derivation tool provides nearly all the 
derivatives from a word when no constraint is 
defined. Most of the wrong derivatives (about 
97%) are screened by the instructions supplied 
by Dubois' dictionary. However, these figure 
are not valid for short words: the tool is desig- 
nated in such a way that derivatives with a rad- 
ical shorter than 3 letters are generally wrong. 
Moreover, the instructions are often incomplete 
in the dictionary, above all nominal entries. 

The promising procedure using taxonomy, 
presented above, is still a suggestion. I am 
facing with the problem that EuroWordNet 
covers only a small part of the French lexi- 
con. A more proper trial should use WordNet 
( |Fellbaum, 1998 ) , that covers a huge part of the 
English lexicon, in an English QA application. 

6 Conclusion 

In this paper, I present an original method to 
merge several dictionaries in such a way that 
all the informative fields become semantically 
consistent. This need comes from an expansion 
method for QA, which uses as enrichment only 
the synonyms, derivatives, and taxonomy that 
match the sense determined by a semantic dis- 
ambiguator. The method to share out informa- 
tion is particular for each kind of enrichment. 
An analysis of the experienced problems shows 
the interest of the method, and insists on the 
points that still have to be tackled. 

In a more general perspective, it is known 
that no perfect dictionary exists. Each dictio- 
nary used by this method has gaps. The method 
used to mix information is filtering data from 
extra dictionaries by data from a reference dic- 
tionary, but errors in the reference are passed on 
to the added data. The right solution should be 
to use more than one reference dictionary. 
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